Before we do any of this, we need to load the libraries we will use today:
library(sf)
## Linking to GEOS 3.6.1, GDAL 2.1.3, proj.4 4.9.3
library(tmap)
library(sp)
library(spdep)
## Loading required package: Matrix
## Loading required package: spData
## Warning: package 'spData' was built under R version 3.4.4
## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source'))`
There are also some new libraries we will use. If you don’t already have these you will need to install them.
You might have noticed that in your list of available packages you might see more than you remember downloading. The idea of dependencies has come up throughout the semester. Packages have dependencies when their code is dependent on (uses code from) another package. For example, if I write some code that I think will be useful, so I release this in the form of the package “rekaR”, but I use ggplot2 in the code, then ggplot2 will be a dependency of rekaR. As a default, R will install all the dependencies for a package when you install your package. So this way you might end up with some packages there that you didn’t realise you had.
Why am I telling you this? Well you should always check if you have a package, before installing it. And I wanted to share with you some neat code PROPERLY REFERENCE SOURCE HERE to do this. I’ll comment it a bit, so you can follow along what it does but you don’t have to if you don’t want to. This is just an optional extra.
So as a first step, you have to assign a list of all the packages you have to check to an object. Let’s say I tell you that today we will be using the following packaes: “sp”, “rgdal”, “classInt”, “RColorBrewer”, “ggplot2”, “hexbin”, “ggmap”, “XML”, and “dplyr”. Then you can add these to an object called libs, using the c() function:
libs <- c("sp", "rgdal", "classInt", "RColorBrewer", "ggplot2", "hexbin", "ggmap", "XML", "dplyr")
Now you can run the below bit of code, and you will see in the console an output of what is and isn’t installed, as well as install the packages that are not!
for (x in libs){ #cycle through each item in libs object
if(x %in% rownames(installed.packages()) == FALSE) { #if the package is not installed
print(paste0("installing ", x, "...")) #print a message to tell me
install.packages(x) #and then install the packages
}
else{ #otherwise (if it is installed)
print (paste0(x, " is already installed ")) #print a message to tell me
}
library(x, character.only = TRUE) #and then load the packages
}
## [1] "sp is already installed "
## [1] "rgdal is already installed "
## rgdal: version: 1.2-16, (SVN revision 701)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 2.1.3, released 2017/20/01
## Path to GDAL shared files: /Library/Frameworks/R.framework/Versions/3.4/Resources/library/rgdal/gdal
## GDAL binary built with GEOS: FALSE
## Loaded PROJ.4 runtime: Rel. 4.9.3, 15 August 2016, [PJ_VERSION: 493]
## Path to PROJ.4 shared files: /Library/Frameworks/R.framework/Versions/3.4/Resources/library/rgdal/proj
## Linking to sp version: 1.2-5
## [1] "classInt is already installed "
## [1] "RColorBrewer is already installed "
## [1] "ggplot2 is already installed "
## [1] "hexbin is already installed "
## [1] "ggmap is already installed "
## [1] "XML is already installed "
## [1] "dplyr is already installed "
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
As you can see if you read through the comments there, this bit of code checks each package in the list you pass to tbe libs object when you create it, and if it is not installed it installs for you, and if it is, it just loads it for you. It can be a handy bit of code to keep around.
Then we will bring back the data from last week:
##R in Windows have some problems with https addresses, that's why we need to do this first:
urlfile<-'https://s3.amazonaws.com/geoda/data/ncovr.zip'
download.file(urlfile, 'ncovr.zip')
#Let's unzip and create a new directory (ncovr) in our working directory to place the files
unzip('ncovr.zip', exdir = 'ncovr')
Last week we did not treated the data as spatial and, consequently, relied on the csv file. But notice that in the unzip ncovr file there is also a shapefile that we can load as a spatial object into R:
shp_name <- "ncovr/ncovr/NAT.shp"
ncovr_sf <- st_read(shp_name)
## Reading layer `NAT' from data source `/Users/reka/Dropbox (The University of Manchester)/CrimeMapping/ncovr/ncovr/NAT.shp' using driver `ESRI Shapefile'
## Simple feature collection with 3085 features and 69 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -124.7314 ymin: 24.95597 xmax: -66.96985 ymax: 49.37173
## epsg (SRID): 4326
## proj4string: +proj=longlat +datum=WGS84 +no_defs
PUT RECAP HERE
take from rpub here
In GIS it is often difficult to present point-based data because in many instances there are several different points and data symbologies that need to be shown. As the number of different data points grows they can become complicated to interpret and manage which can result in convoluted and sometimes inaccurate maps. This becomes an even larger problem in web maps that are able to be depicted at different scales because smaller scale maps need to show more area and more data. This makes the maps convoluted if multiple data points are included.
In many maps there are so many data points included that little can be interpreted from them. In order to reduce congestion on maps many GIS users and cartographers have turned to a process known as binning.
Binning is defined as the process of grouping pairs of locations based on their distance from one another. These points can then be grouped as categories to make less complex and more meaningful maps.
Researchers and land managers often require a way to systematically divide the landscape into equal-sized portions. As well as making maps with many points easier to read, binning data into regions can help identify spatial influence of neighbourhoods, and can be an essential step in developing systematic sampling designs.
This approach to binning generates an array of repeating shapes over a user-specified area. These shapes can be hexagons, squares, rectangles, triangles, circles or points, and they can be generated with any directional orientation.
Binning is a data modification technique that changes the way data is shown at small scales. It is done in the pre-processing stage of data analysis to convert the original data values into a range of small intervals, known as a bin. These bins are then replaced by a value that is representative of the interval to reduce the number of data points.
Spatial binning (also called spatial discretization) discretizes the location values into a small number of groups associated with geographical areas or shapes. The assignment of a location to a group can be done by any of the following methods: - Using the coordinates of the point to identify which “bin” it belongs to. - Using a common variable in the attribute table of the bin and the point layers.
Binning itself is a general term used to describe the grouping of a dataset’s values into smaller groups (Johnson, 2011). The bins can be based on a variety of factors and attributes such as spatial and temporal and can thus be used for many different projects.
You might be thinkging, “grouping points into a larger spatial unit, haven’t we already done this when making choropleth maps?”. In a way you are right. Choropleth maps are another type of map to that uses binning. Proportional symbol and choropleth maps group similar data points together to show a range of data instead of many individual points. We’ve covered this extensively, and is generally the best approch to consider spatial grouping of your point variables, because the polygons (shapes) to which you are aggregating your points are meaningful. You can group into LSOAs because you want to show variation in neighbourhoods. Or you can group into police force areas because you want to look at differences between those units of analysis. But sometimes there is just not a geography present to meet your needs.
Let’s say you are conducting some days of action in Manchester city centre, focusing on antisocial behaviour. You are going to put up some information booths and staff them with officers to engage with the local population about antiscoail behaviour. For these to be most effective, as an analyst you decide that they should go into the areas with the highest count of antisocial beaviour. You want to be very specific about where you put these as well, and so LSOA level would be too broad, you want to zoom in more. One approach can be to split central Manchester into some smaller polygons, and just calculate the number of antisocial behaviour incidents recorded in each. That way you can then decide to put your information booths somewhere inside the top 5 highest count bins.
Rectangular binning is the simplest binning method and as such it heavily used.
Put in some examples of Rectangular binning here. Also reasons why it’s not the best (and why hexagonnal might be better).
In many applications binning is done using a technique called hexagonal binning. This technique uses hexagon shapes to create a grid of points and develops a spatial histogram that shows different data points as a range or group of pairs with common distances and directions. In hexagonal binning the number of points falling within a particular rectangular or hexagon in a gridded surface is what makes the different colors to easily visualize data (Smith, 2012). Hexagonnal binning was first developed in 1987 and today “hexbinning” is conducted by laying a hexagonal grid on top of 2-dimensional data (Johnson, 2011). Once this is done users can conduct data point counts to determine the number of points for each hexagon (Johnson, 2011). The bins are then symbolized differently to show meaningful patterns in the data.
So how can we use hexbinning to solve our antisocial behaviour days of action task? Well let’s say we split Manchester city centre into hexagons, and count the number of antisocial behaviour instances in these. We can then identify the top hexagons, and locate our booths somewhere within these.
First make sure you have the appropriate packages loaded:
library(ggplot2)
library(ggmap)
library(hexbin)
Also let’s get some data. You could go and get this data yourself from police.uk, we’ve been through all the steps for downloading data from there a few times now. But for now, I have a tidied set of data ready for you. This data is one year’s worth of antisocial behaviour from the police.uk data, from May 2016 to May 2017, for the borough of Manchester. You can download from the dropbox link here:
REPLACE WITH THE DROPBOX LINK WHEN ONLINE!!!!
manchester_asb <- read.csv("/Users/reka/Dropbox (The University of Manchester)/31152_60142 GIS and Crime Mapping/2018_labs/data/manchester_asb.csv")
As a first step, we can plot asb in the borough of Manchester using simple ggplot! Remember the data visualisation session from weeks ago? We discussed how ggplot is such a great tool for building visualisations, because you can apply whatever geometry best suits your data. So for us to just have a look at the hexbinned version of our point data of antisocial behaviour, we can use the stat_binhex() function. We can also recreate the thematic map element, as we can use the frequency of points in each hex to shade each hexbin from white (least number of incidents) to red (most nuber of incidents).
So let’s have a go:
ggplot(manchester_asb, aes(Latitude, Longitude)) + #define data and variables for x and y axes
stat_binhex() + #add binhex layer (hexbin)
scale_fill_gradientn(colours = c("white","red"), name = "Frequency") #add shading based on number of ASB incidents
Neat, but doesn’t quite tell us where that really dark hexbon actually is. So it would be much better if we could do this with a basemap as the backrgound, rather than our grey ggplot theme. To do this, we can make use of ggmap. First download the baselayer for Manchester using ggmap’s get_map() function:
# get the basemap
mcr_basemap <- get_map(location="Manchester, UK", zoom=12, maptype = 'roadmap')
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Manchester,+UK&zoom=12&size=640x640&scale=2&maptype=roadmap&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Manchester,%20UK&sensor=false
# take a look
ggmap(mcr_basemap)
Now, we can apply the same code as we used above, for the ggplot, to this ggmap, to add our hexbins on top of this basemap:
ggmap(mcr_basemap) + #load basemap
coord_cartesian() + #make sure our coords are in a cartesian coordinate system. To learn more see ?coord_cartesian()
stat_binhex(data = manchester_asb, #use stat_binhex with arguments like in ggplot
aes(x = Longitude,
y = Latitude)) +
scale_fill_gradientn(colours = c("white","red"), #use fillx with arguments like in ggplot
name = "Frequency")
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Warning: Removed 4183 rows containing non-finite values (stat_binhex).
## Warning: Removed 6 rows containing missing values (geom_hex).
Now this should give you some more context! Woo!
Now, to illustrate the differences of different approaches, let’s see what this map would look like with:
ggmap(mcr_basemap) +
stat_bin2d(data = manchester_asb, aes(x = Longitude,
y = Latitude)) +
scale_fill_gradientn(colours = c("white","red"),
name = "Frequency")
## Warning: Removed 4183 rows containing non-finite values (stat_bin2d).
ggmap(mcr_basemap) +
stat_density2d(data = manchester_asb, aes(x = Longitude,
y = Latitude,
fill = ..level.., # value corresponding to discretized density estimates
alpha = ..level..),
geom = "polygon") + # creates the bands of differenc dolors
## Configure the colors, transparency and panel
scale_fill_gradientn(colours = c("white","red"),
name = "Frequency")
## Warning: Removed 4183 rows containing non-finite values (stat_density2d).
Look at the difference between the three maps (hex, rectangle, and density). How would your conclusions change if you were given these maps? Would you make different decisions about where to place your booths for the days of action? Why or why not? Discuss.
Multivariate binning is another binning method that lets you visualise slightly more complex data. In this method there can be many different variables consisting of different types of data. Like other binning methods the data is typically grouped with the sum or average of the data. Different types of symbology (such as size, shape and color) can also be used to represent this data as well.
We won’t be covering this here but just so you can have a look at some examples here: ADD EXAMPLES HERE
Because of the plethora of data types available and the wide variety of projects being done in GIS, binning is a popular method for mapping complex data and making it meaningful. Binning is a good option for map makers as well as users because it makes data easy to understand and it can be both static and interactive on many different map scales. If every different point were shown on a map it would have to be a very large scale map to ensure that the data points did not overlap and were easily understood by people using the maps.
According to Kenneth Field, an Esri Research Cartographer, “Data binning is a great alternative for mapping large point-based data sets which allows us to tell a better story without interpolation. Binning is a way of converting point-based data into a regular grid of polygons so that each polygon represents the aggregation of points that fall within it.”
By using binning to create categories of data maps are easier to understand, more accurate and more visually appealing.
Hexbin plots can be viewed as an alternative to scatter plots. The hexagon-shaped bins were introduced to plot densely packed sunflower plots. They can be used to plot scatter plots with high-density data.
The Modifiable Areal Unit Problem (MAUP) is an important issue for those who conduct spatial analysis using units of analysis at aggregations higher than incident level. It is one of the better-known problems in geography and spatial analysis. This phenomenon illustrates both the need for considering space in one’s analysis, and the fundamental uncertainties that accompany real-world analysis.
The MAUP is “a problem arising from the imposition of artificial units of spatial reporting on continuous geographical phenomena resulting in the generation of artificial spatial patterns” (Heywood, 1988). In other words, artifacts or errors are created when one groups data into units for analysis.
The classic text on MAUP is the 1983 paper by Openshaw.
There are two distinct types of MAUP: Scale (i.e. determining the appropriate size of units for aggregation) and zone (i.e. drawing boundaries or grouping).
The scale problem involves results that change based on data that are analyzed at higher or lower levels of aggregation (Changing the number of units). For example, evaluating data at the state level vs. Census tract level.
The scale problem has moved to the forefront of geographical criminology as a result of the recent interest in small-scale geographical units of analysis, most notably when employing hot-spot analysis of crime at micro-places (Sherman 1995; Weisburd et al. 2004, 2006, 2009b, 2010, 2014b). It has been suggested that smaller is better since small areas can be directly perceived by individuals and are likely to be more homogenous than larger areas (Oberwittler and Wikstro¨m 2009).
The zonal problem involves keeping the same scale of research (say, at the state level) but changing the actual shape and size of those areas.
The basic issue with the MAUP is that aggregate units of analysis are often arbitrarily produced by whom ever is in charge of creating the aggregate units. A classic example of this problem is known as Gerrymandering. Gerrymandering involves shaping and re-shaping voting districts based on the political affiliations of the resident citizenry. The problem was first described by Openshaw (1984) when he stated “the areal units (zonal objects) used in many geographical studies are arbitrary, modifiable, and subject to the whims and fancies of whoever is doing, or did, the aggregating.”
The inherent problem with the MAUP and with situations such as Gerrymandering is that units of analysis are not based on geographic principles, and instead are based on political and social biases. For researchers and practitioners the MAUP has very important implications for research findings because it is possible that as arbitrarily defined units of analysis change shape findings based on these units will change as well.
When spatial data are derived from counting or averaging data within areal units, the form of those areal units affects the data recorded, and any statistical measures derived from the data. Modifying the areal units therefore changes the data. Two effects are involved: a zoning effect arising from the particular choice of areas at a given scale; and an aggregation effect arising from the extent to which data are aggregated over smaller or larger areas. The modifiable areal unit problem arises in part from edge effect.
The practical implications of MAUP are immense for almost all decision-making processes involving GIS technology, since with the availability of aggregated maps, policy could easily focus on issues and problems which might look different if the aggregation scheme used were changed (O’Sullivan & Unwin, 2010).
All studies based on geographical areas are susceptible to the modifiable areal unit problem (MAUP), which states that the way areas have been constructed can have a major impact on results (Openshaw 1984; See also Openshaw 1996; Flowerdew 2011; Tita and Radil 2010; Oberwittler and Wikstro¨m 2009; Lupton 2003; Hipp and Boessen 2013; Hipp 2007).
The implications of the MAUP affect potentially any area level data, whether direct measures or complex model-based estimates. Here are a few examples of situations where the MAUP is expected to make a difference:
Also add gerrymandering discussion from elections recent debate etc. While much of the discussion has been by political scientists around election districts, MAUP is important to consider in relation to criminology.
Most often you will just have to remain aware of the MAUP and it’s possible effects.There are some techniquest, for example to address edge effects. It is possible to use also an alternative, zone-free approach to mapping these crime patterns, perhaps by using kernel density estimation. Here we model the relative density of the points as a density surface - essentially a function of location (x,y) representing the relative likelihood of occurrence of an event at that point. We have covered KDE elsewhere in this course.
For the purposes of this course, it’s enough that you know of, and understand the MAUP and its implications. Always be smart when choosing your appropriate spatial unit of analysis, and when you use binning of any form, make sure you consider how and if your conclusions might change compared to another possible approach.
Look at the question for homework 10.3 about the three maps of binning and the hotspot map. Answer this question again, but now in light of what you have learned about MAUP.